For more info on Myria, access to the demo cluster, and setting up a cluster: http://myria.cs.washington.edu/

Connecting to Myria



In [ ]:

    
# myria-python functionality
from myria import *

# myriaL cell functionality
%load_ext myria

# connection for myria-python functionality
connection = MyriaConnection(rest_url='http://localhost:8753')
# same as: http://ec2-52-36-55-94.us-west-2.compute.amazonaws.com:8753

# connection for myriaL cell functionality
%connect http://localhost:8753

MyriaL Basics

First you would scan in whatever tables you want to use in that cell. These are the tables visible in the Myria-Web datasets tab.

R1 = scan(cosmo8_000970);

This put all of the data in the cosmo8_000970 table into the relation R1 which can now be queried with MyriaL

R2 = select * from R1 limit 5;

Once we have a relation, we can query it, and store the result in a new relation. Sometimes you just want to see the output of the cell you're running, or sometimes you want to store the result for later use. In either case, you have to store the relation that you want to see or store the output of, because otherwise Myria will optimize the query into an empty query.

store(R2, MyInterestingResult);

This will add MyInterestingResult to the list of datasets on Myria-Web. If you are running multiple queries and want to just see their results without storing multiple new tables, you can pick a name and overwrite it repeatedly:

%%query
    ...
    store(R2, temp);
...
query%%
    ...
    store(R50, temp);

All statements need to be ended with a semicolon!

Also, note that a MyriaL cell cannot contain any Python. These cells are Python by default, but a MyriaL cell starts with %%query and can only contain MyriaL syntax.



In [ ]:

    
%%query
-- comments in MyriaL look like this 
-- notice that the notebook highlighting still thinks we are writing python: in, from, for, range, return

R1 = scan(cosmo8_000970);
R2 = select * from R1 limit 5;
R3 = select iOrder from R1 limit 5;
store(R2, garbage);



In [ ]:

    
%%query
-- there are some built in functions that are useful, just like regular SQL:
cosmo8 = scan(cosmo8_000970);
countRows = select count(*) as countRows from cosmo8;
store(countRows, garbage);



In [ ]:

    
%%query
-- lets say we want just the number of gas particles
cosmo8 = scan(cosmo8_000970);
c = select count(*) as numGas from cosmo8 where type = 'gas';
store(c, garbage);



In [ ]:

    
%%query
-- some stats about the positions of star particles
cosmo8 = scan(cosmo8_000970);
positionStats = select min(x) as min_x
        , max(x) as max_x
        , avg(x) as avg_x
        , stdev(x) as stdev_x
        , min(y) as min_y
        , max(y) as max_y
        , avg(y) as avg_y
        , stdev(y) as stdev_y
        , min(z) as min_z
        , max(z) as max_z
        , avg(z) as avg_z
        , stdev(z) as stdev_z
    from cosmo8
    where type = 'star';
store(positionStats, garbage);



In [ ]:

    
# we can also create constants in Python and reference them in MyriaL
low = 50000
high = 100000
destination = 'tempRangeCosmo8'



In [ ]:

    
%%query
-- we can reference Python constants with '@'
cosmo8 = scan(cosmo8_000970);
temps = select iOrder, mass, type, temp
    from cosmo8
    where temp > @low and temp < @high
    limit 10;

store(temps, @destination);

User Defined Functions

In MyriaL we can define our own functions that will then be applied to the results of a query. These can either be written in Python and registered with Myria or they can be written directly within a MyriaL cell (but not in Python).

When registering a Python function as a UDF, we need to specify the type of the return value. The possible types are the INTERNAL_TYPES defined in raco.types as seen here

Currently a function signature can't be registered more than once. In order to overwrite an existing registered function of the same signature, you have to use the Restart Kernel button in the Jupyter Notebook toolbar.



In [ ]:

    
from raco.types import DOUBLE_TYPE
from myria.udf import MyriaPythonFunction

# each row is passed in as a tupl within a list
def sillyUDF(tuplList):
    row = tuplList[0]
    x = row[0]
    y = row[1]
    z = row[2]
        
    if (x > y):
        return x + y + z
    else:
        return z

# A python function needs to be registered to be able to 
# call it from a MyriaL cell
MyriaPythonFunction(sillyUDF, DOUBLE_TYPE).register()



In [ ]:

    
# To see all functions currently registered
connection.get_functions()



In [ ]:

    
%%query
-- for your queries to run faster, its better to push the UDF to the smallest possible set of data
cosmo8 = scan(cosmo8_000970);
small = select * from cosmo8 limit 10;
res = select sillyUDF(x,y,z) as sillyPyRes from small;
store(res, garbage);



In [ ]:

    
%%query
-- same thing but as a MyriaL UDF
def silly(x,y,z):
    case
        when x > y
        then x + y + z
        else z
    end;
    
cosmo8 = scan(cosmo8_000970);
res = select silly(x,y,z) as sillyMyRes from cosmo8 limit 10;
store(res, garbage);



In [ ]:

    
from raco.types import DOUBLE_TYPE

def distance(tuplList):
    # note that libraries used inside the UDF need to be imported inside the UDF
    import math
    row = tuplList[0]
    x1 = row[0]
    y1 = row[1]
    z1 = row[2]
    x2 = row[3]
    y2 = row[4]
    z2 = row[5]
    
    return math.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2)

MyriaPythonFunction(distance, DOUBLE_TYPE).register()



In [ ]:

    
print distance([(.1, .1, .1, .2, .2, .2)])



In [ ]:

    
eps = .0042



In [ ]:

    
%%query
-- here I am trying to find all points within eps distance of a given point
-- in order to avoid the expensive UDF distance() call on every point in the data,
-- I first filter the points by a simpler range query that immitates a bounding box
cosmo8 = scan(cosmo8_000970);
point = select * from cosmo8 where iOrder = 68649;
cube = select c.* from cosmo8 as c, point as p
    where abs(c.x - p.x) < @eps
    and abs(c.y - p.y) < @eps
    and abs(c.z - p.z) < @eps;
distances = select c.*, distance(c.x, c.y, c.z, p.x, p.y, p.z) as dist from cube as c, point as p;
res = select * from distances where dist < @eps;
store(res, garbage);



In [ ]:

    
%%query
cosmo8 = scan(cosmo8_000970);
point = select * from cosmo8 where iOrder = 68649;
cube = select c.* from cosmo8 as c, point as p
    where abs(c.x - p.x) < @eps
    and abs(c.y - p.y) < @eps
    and abs(c.z - p.z) < @eps;
onlyGases = select * from cube where type = 'gas';
distances = select c.*, distance(c.x, c.y, c.z, p.x, p.y, p.z) as dist from onlyGases as c, point as p;
res = select * from distances where dist < @eps;
store(res, garbage);

There is also special syntax for user defined aggregate functions, which use all of the rows to produce a single output, like a Reduce or Fold function pattern:

uda func-name(args) {
 initialization-expr(s);
 update-expr(s);
 result-expr(s);
};

Where each of the inner lines is a bracketed statement with an entry for each expression that you want to output.



In [ ]:

    
%%query
-- UDA example using MyriaL functions inside the UDA update line
def pickBasedOnValue2(val1, arg1, val2, arg2):
    case
        when val1 >= val2
        then arg1
        else arg2
    end;
    
def maxValue2(val1, val2):
    case
        when val1 >= val2
        then val1
        else val2
    end;

uda argMaxAndMax(arg, val) {
    [-1 as argAcc, -1.0 as valAcc];
    
    [pickBasedOnValue2(val, arg, valAcc, argAcc),
     maxValue2(val, valAcc)];
    
    [argAcc, valAcc];
};

cosmo8 = scan(cosmo8_000970);
res = select argMaxAndMax(iOrder, vx) from cosmo8;
store(res, garbage);



In [ ]:

    
# Previously when we wrote a UDF we expected the tuplList to only hold one row
# but UDFs that are used in a UDA could be given many rows at a time, so it is
# important to loop over all of them and keep track of the state/accumulator outside
# the loop, and then return the value that is expected by the update-expr line in the UDA.
from raco.types import LONG_TYPE
def pickBasedOnValue(tuplList):
    maxArg = -1
    maxVal = -1.0
    
    for tupl in tuplList:
        value1 = tupl[0]
        arg1 = tupl[1]
        value2 = tupl[2]
        arg2 = tupl[3]
        if (value1 >= value2):
            if (value1 >= maxVal):
                maxArg = arg1
                maxVal = value1
        else:
            if (value2 >= maxVal):
                maxArg = arg2
                maxVal = value2
    return maxArg

MyriaPythonFunction(pickBasedOnValue, LONG_TYPE).register()

from raco.types import DOUBLE_TYPE
def maxValue(tuplList):
    maxVal = -1.0
    
    for tupl in tuplList:
        value1 = tupl[0]
        value2 = tupl[1]
        if (value1 >= value2):
            if (value1 >= maxVal):
                maxVal = value1
        else:
            if (value2 >= maxVal):
                maxVal = value2
    return maxVal

MyriaPythonFunction(maxValue, DOUBLE_TYPE).register()



In [ ]:

    
%%query
-- UDA example using Python functions inside the UDA update line
uda argMaxAndMax(arg, val) {
    [-1 as argAcc, -1.0 as valAcc];
    
    [pickBasedOnValue(val, arg, valAcc, argAcc),
     maxValue(val, valAcc)];
    
    [argAcc, valAcc];
};
t = scan(cosmo8_000970);
s = select argMaxAndMax(iOrder, vx) from t;
store(s, garbage);



In [ ]:

    
%%query
-- of course, argMaxAndMax can be done much more simply:
c = scan(cosmo8_000970);
m = select max(vx) as mvx from c;
res = select iOrder, mvx from m,c where vx = mvx;
store(res, garbage);

Working with multiple snapshots

On the Myria demo cluster we only provide cosmo8_000970, but on a private cluster we could load in any number of snapshots to look for how things change over time.



In [ ]:

    
%%query
c8_000970 = scan(cosmo8_000970);
c8_000962 = scan(cosmo8_000962);

-- finding all gas particles that were destroyed between step 000962 and 000970
c1Gases = select iOrder from c8_000962 where type = 'gas';
c2Gases = select iOrder from c8_000970 where type = 'gas';

exist = select c1.iOrder from c1Gases as c1, c2Gases as c2 where c1.iOrder = c2.iOrder;

destroyed = diff(c1Gases, exist);
store(destroyed, garbage);



In [ ]:

    
%%query
c8_000970 = scan(cosmo8_000970);
c8_000962 = scan(cosmo8_000962);

-- finding all particles where some property changed between step 000962 and 000970
res = select c1.iOrder
    from c8_000962 as c1, c8_000970 as c2
    where c1.iOrder = c2.iOrder 
    and c1.metals = 0.0 and c2.metals > 0.0;
store(res, garbage);



In [ ]:

    
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
To toggle on/off output_stderr, click <a href="javascript:code_toggle_err()">here</a>.''')